This assignment is for ETC5521 Assignment 1 by Team Echidna comprising of Ruimin Lin and Rahul Bharadwaj.

Introduction and Motivation

Board Game has been a type of leisure that people have enjoyed from a very long time even before computers and video-games existed and has gone through enormous evolution ever since its inception. Board Games enables a way for people to socialize, reducing stress under such a fast-moving society, and paves way for an extensive brain exercise. Being a popular choice of leisure, what makes board games great? What is the reason for Board Games to have survived in a world of Virtual Reality games? In other words, what are the common characteristics of top ranked board games? What are the best board games in terms of average rating?

The original board games data used in this report is obtained from the Board Game Geek database, and is cleaned and shared by Thomas Mock.

The tidy dataset consists of 22 columns and 10532 rows, in which there are 22 variables and 10532 observations. It consists of data such as max/min playtime, max/min players, min age of players that can play, game designer, game publisher, mechanics of the game and a lot more. One thing to notice is that even though the data set is tidy, we still find observations in variables like category, family, mechanic to be messy and repetitive, which may limit our ability to explore these variables.

Data Description

The aim of this exploratory analysis is to find out what factor affects the average rating of board games. This would give insights as to what board games are most popular and the characteristics these board games share. Therefore, we have articulated the following questions to help us with further exploration of the board games data.

Primary Question:

What are the common characteristics of top ranked board games?

Secondary Questions:

  1. What are the top 10 ranked board games?
  2. How do variables like min/max playtime, min/max players, or min_age affect the average rating?
  3. Which game designer was most successful in producing popular games? Which publisher published the most popular games?

The variables included in the data are as follows:

## Rows: 10,532
## Columns: 22
## $ game_id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ description    <chr> "Die Macher is a game about seven sequential politic...
## $ image          <chr> "//cf.geekdo-images.com/images/pic159509.jpg", "//cf...
## $ max_players    <dbl> 5, 4, 4, 4, 6, 6, 2, 5, 4, 6, 7, 5, 4, 4, 6, 4, 2, 8...
## $ max_playtime   <dbl> 240, 30, 60, 60, 90, 240, 20, 120, 90, 60, 45, 60, 1...
## $ min_age        <dbl> 14, 12, 10, 12, 12, 12, 8, 12, 13, 10, 13, 12, 10, 1...
## $ min_players    <dbl> 3, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 2, 2...
## $ min_playtime   <dbl> 240, 30, 30, 60, 90, 240, 20, 120, 90, 60, 45, 45, 6...
## $ name           <chr> "Die Macher", "Dragonmaster", "Samurai", "Tal der Kö...
## $ playing_time   <dbl> 240, 30, 60, 60, 90, 240, 20, 120, 90, 60, 45, 60, 1...
## $ thumbnail      <chr> "//cf.geekdo-images.com/images/pic159509_t.jpg", "//...
## $ year_published <dbl> 1986, 1981, 1998, 1992, 1964, 1989, 1978, 1993, 1998...
## $ artist         <chr> "Marcus Gschwendtner", "Bob Pepper", "Franz Vohwinke...
## $ category       <chr> "Economic,Negotiation,Political", "Card Game,Fantasy...
## $ compilation    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "CAT...
## $ designer       <chr> "Karl-Heinz Schmiel", "G. W. \"Jerry\" D'Arcey", "Re...
## $ expansion      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Elfengold,Elfen...
## $ family         <chr> "Country: Germany,Valley Games Classic Line", "Anima...
## $ mechanic       <chr> "Area Control / Area Influence,Auction/Bidding,Dice ...
## $ publisher      <chr> "Hans im Glück Verlags-GmbH,Moskito Spiele,Valley Ga...
## $ average_rating <dbl> 7.66508, 6.60815, 7.44119, 6.60675, 7.35830, 6.52534...
## $ users_rated    <dbl> 4498, 478, 12019, 314, 15195, 73, 2751, 186, 1263, 6...

The explanation of variables and variable types are provided to enable a better understanding of the variables in board games data set.

To ensure the reliability of the board game ratings, the data is limited to games with at least 50 ratings and for games between 1950 and 2016. The site’s database has more than 90,000 games with crowd-sourced ratings.

The original board games data set consists of 90400 observations, and 80 variables. Therefore, data cleaning and wrangling is necessary to enable better analysis procedure. Thomas has replaced long and complicated variable names like details.description in original data to description using janitor::clean_names and set_names function, which avoids messy code writing. In addition, he has eliminated around 50 variables using the select function and that leaves 27 variables at this stage.

The data set is then filtered to board games published from 1950 to 2016, with at least 50 users rated. ‘NA’ values in variable year_published is also omitted. Thomas then excludes variables that may not be useful for the analysis, such as attributes_total, game_type etc., which ultimately, leaves us with a tidy data set (22 variables and 10532 variables) that is relatively concise and convenient for further exploration.

Analysis and Findings

Initial Data Analysis

  • Initial Data Analysis is a process which helps one get a feel of the data in question. This helps us have an overview of the data and gives insights about potential Exlporatory Data Analyis (EDA).
  • Initial data analysis is the process of data inspection steps to be carried out after the research plan and data collection have been finished but before formal statistical analyses. The purpose is to minimize the risk of incorrect or misleading results. Link for more info
  • IDA can be divided into 3 main steps:
    • Data cleaning is the identification of inconsistencies in the data and the resolution of any such issues.
    • Data screening is the description of the data properties.
    • Documentation and reporting preserve the information for the later statistical analysis and models.
Visualization of Data Types

Visualization of Data Types

  • The plot above clearly visualizes the distribution od data types in our dataset with column in x-axis and number of observations on the y-axis. This gives a concise overview of the data and what columns are useful for analysis. This plot hints that we can use all the numeric columns along with designer and publisher columns for our analysis.
Visualization of Missing Values

Visualization of Missing Values

  • The above plot shows the percentage of missing values and where exactly they are missing with x-axis showing columns and the y-axis showing the corresponding observations. We can also observe that each column has a percentage of missing values mentioned which come in handy while deciding what columns not to pick for analysis.

  • It is evident that the following columns have missing values and are not of much use for the analysis:

    • compilation - 96.11% missing
    • expansion - 73.87% missing
    • family - 26.66% missing
    • mechanic - 9.02% missing
  • This is a limitation of the dataset and we frame our questions keeping this in mind.

Questions of Interest

1. What are the top 10 ranked board games?

Top 10 ranked board games

Top 10 ranked board games

name average_rating max_playtime min_playtime max_players min_players
Small World Designer Edition 9.00392 80 40 6 2
Kingdom Death: Monster 8.93184 180 60 6 1
Terra Mystica: Big Box 8.84862 150 60 5 2
Last Chance for Victory 8.84603 60 60 2 2
The Greatest Day: Sword, Juno, and Gold Beaches 8.83081 6000 60 8 2
Last Blitzkrieg 8.80263 960 180 4 2
Enemy Action: Ardennes 8.75802 600 0 2 1
Through the Ages: A New Story of Civilization 8.74235 240 180 4 2
1817 8.70848 540 360 7 3
Pandemic Legacy: Season 1 8.66878 60 60 4 2

2. How do variables like min/max playtime, min/max players, or min_age affect the average rating in these top-ranked board games?

  • To have a better idea on the common characteristics of top-ranked board games, we have widened the range to top 50.(to ensure the reliability)
Visualization of Data Types in Top 50 Games

Visualization of Data Types in Top 50 Games

  • The above plot shows a distribution of Data Types in our Top 50 Games dataset with x-axis showing column names and y-axis its corresponding observations.

  • It is evident that our selection of columns is appropriate and there are no missing values in our data. Hence, we need not check for missing values through vis_miss() function. We can use all these columns for an effective analysis of our questions of interest.

  • In plot 1 we can see that there are a few obvious distinct values present, which are:
    • The Greatest Day:Sword, Juno, and Gold Beaches with 6000 minutes max. playtime and an average rating of 8.8308
    • Axis Empires: Totaler Krieg! with 3600 minutes max. playtime and average rating of 8.4194
    • Beyond the Rhine with 3000 minutes max. playtime and average rating of 8.5979
  • It is difficult to examine the trend or common characteristics with these outliers presents, therefore, we have limited the maximum playtime to less than xx minutes using the IQR outliers formula. (Q1 - 1.5IQR and Q3 + 1.5 IQR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    82.5   142.5   461.1   345.0  6000.0
## [1] -311.25
## [1] 738.75

Now we can have a clearer picture of where majority of top-50 ranked board games lie in the graph of average rating against maximum playtime. Where, majority of board games lie within the range of 200 minutes of maximum playtime.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    45.0    60.0   247.5   180.0  3600.0
## [1] -157.5
## [1] 382.5

  • We have implemented the same method to omit the outliers as done previously, the graph demonstrates that in top-50 ranked board games, most of them have a minimum playtime less than 100 minutes.

  • In the scatterplot for average rating against minimum players, we observed that most top 50 board games have at least 2 players.

  • In the scatterplot for average rating against maximum players, we observed that most top 50 board games have a maximum of 4 or 5 players.

  • In the scatterplot for average rating against minimum age of players, we observed that the minimum age set by majority of board games are between 10 - 15.
Summarizing all observations as Boxplots

Summarizing all observations as Boxplots

  • All the insights for the top 50 popular games are summarized in the boxplots above as follows:
    • A maximum of 4 players and minimum of 2 players is most popular in the top 50 games.
    • The maximum and minimum playtime seem to be almost close and range between 60-150 minutes for top 50 games.
Relationship between Average Rating and other Attributes

Relationship between Average Rating and other Attributes

  • The above plot shows a trend for different attributes against average rating on x-axis. We can get a better idea using this pattern.

  • We can observe the following trend for the top 50 rated games as average rating increases -

    • The Minimum Players tends to be around 2 players. The Maximum Players tends to be around 4 and increases up to 6.
    • The Minimum Playtime tends to vary between 60-500 minutes. The Maximum Playtime tends to vary between 150-1000 minutes.

  • We can observe the following for the attribute Minimum Age -

    • Players of age between 10-15 years mostly play the top 50 games.
    • We can observe from the trend that games are more popular among age group of 7-13 year olds

3. Which game designer was most successful in producing popular games? Which publisher published the most popular games?

Top 10 Game Designers

Top 10 Game Designers

  • The above scatter-plot consists of average rating on x-axis and designer on y-axis. The black x-intercept represents the mean value of average ratings of the top 10 designers. The plot conveys that the mean average rating is around 8.82 with 5 observations on either side of the line.

  • Philippe Keyaerts has the highest rated game at around 9+ followed by Vlaada Chvatil around 8.93 with all the other designers falling around the mean value. The lesser rated designer in the top 10 is Rob Daviau, Matt Leacock. We should note that Dean Essig has two games in the top 10.

  • Who among these is the best is still a debatable question. Some might say it is Dean, while some might consider Philippe. Nevertheless, all of the designers in the plot are among the top 10 and have produced the most popular games.

Top 7 Game Publishers

Top 7 Game Publishers

  • The above scatter-plot consists of average rating on x-axis and publisher on y-axis.

  • The first thing that strikes from looking at this plot is that Multi-Man Publishing has 3 among the top 7 rated board games which hints that they are one of the best publishers.

  • The top rated game was published by Days of Wonder and the lesser rated game in the top 7 was published by Compass Games. Again, it is debatable as to who is best but the publishers in the above plot have published some of the finest board games.

Bonus Insight - An interesting takeaway from the above two plots is that the best and top rated board games were launched between 2010-2016 with most of the top rated games launched in the year 2015.

References